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Abstract 

Resampling Approach for Estimation of Models of Calculation 
Processes and Information Systems. M. Fioshin. The doctoral 
degree thesis. Supervisor Dr.Habil.Sc.Eng., professor A. Andronov. 

The work is devoted to the analysis of the Resampling method 
proposed by A. Andronov and to the analysis of the Resampling 
method application possibility to the estimation and simulation of 
the calculation and logical systems reliability. The work Simple 
and Hierarchical method properties are considered, algorithms for 
variance are shown. The methods are applied for processes in 
the multitask operation system and queries to database analysis, 
a comparison with the classical method, that uses the empirical 
distribution functions, is made. Numerical examples illustrate the 
influence of different factors on the Resampling method efficiency. 

The task of the sample size optimization has been considered. 
The dynamic programming method is applied to minimize the 
variance of the Resampling estimator. Optimization is applied 
for the analysis of queries to database, the numerical example 
illustrates the value of optimization. 

The case of partially known distributions is considered. It is 
shown how to use the Resampling approach in the case when the 
distributions of some input variables are known. The method 
is applied to database query analysis and a comparison with 
Hierarchical Resampling is made. 

The construction of the Resampling confidence intervals is 
considered. The algorithm for construction of the Resampling 
confidence intervals is shown and actual coverage probabilities 
are calculated. Examples for the multitask operation system 
analysis illustrate the calculation of the actual coverage probability 
algorithm. 
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1 Importance of the Work Subject 

At the present time the possibihties of computers are developing 
rapidly. The computer has become a common instrument of a 
scientist. It can help us in scientific research and allows us to solve 
tasks, that could not be solved before. 

Thus, a question arises - how to use a computer in scientific 
research? How can a computer help a scientist, besides simple 
calculations and information storing? A great attention is paid to 
this question at a present time. 

When computers appeared, a new directions began to develop 
in many sciences which tried to solve the problems of respective 
science by using a computer. At the beginning they were numerical 
methods in mathematics and physics. Later computers began to 
be used for problem solving in such sciences as chemistry, biology, 
geology, economics etc. 

In early 70-s the possibilities of the computer started to be used 
also in statistics. It was clear, that by using computers data analysis 
can be efficiently performed. But classical statistical methods, as 
in the rest of sciences, are not oriented to computer application. 
Classical methods suppose formula is obtained as the method result, 
and the formula gives result after small amount of calculations. 
Such method application is relatively complex, requires different 
assumptions about the model kind, requires model transformation, 
which is difficult to realize on a computer. 

As an alternative to classical methods a group of statistical 
methods appears, called intensive statistical computer methods or 
calculation statistics. The methods which belong to this group are 
simple, can be easily realized on the computer, but require a big 
amount of calculations. Usually these methods do not require many 
assumptions about the model structure, do not require complex 
data transformations, but the result is not so accurate, as in the 
case of classical methods. 

As intensive methods do not require many assumptions about 



the model structure, they can be used to solve a wide class 
of problems. These methods allow us to analyze data from 
different points to discover dependences, that were not seen before. 
Intensive computer methods allow us to solve problems, which in 
the classical model limits cannot be solved or can be solved with 
big assumptions. 

This area is rapidly developing. Now the amount of information 
is huge and analysis of information has become one of the most 
important tasks of computer sciences. In real situations we need 
to solve tasks which are difficult to solve using classical statistical 
methods. For solving such tasks the intensive statistical computer 
methods are used. 

The intensive statistical methods have two sides. On the one 
hand, the usage of such methods is simple. But, on the other hand, 
accurate analysis of such methods is a complex task. Often it is 
more difficult to analyze a simple intensive method than a complex 
classical one. But the analysis of intensive methods is required, 
because without it we cannot guarantee, that the method will give 
a correct answer and in the case of the correct answer there is no 
possibility to estimate the efficiency and accuracy of the answer. 
Thus the analysis of intensive statistical methods is actual task. 

A new intensive statistical method, called Resampling, is 
considered in the work. This method can be applied efficiently 
for different statistical tasks solving, for example, statistical 
estimation, simulation, confidence intervals construction. The 
method can be applied for different systems estimation and 
simulation, including the analysis of information systems. Possible 
applications of the method for information systems estimation are 
considered in the work, application examples are shown. 

One of the main goals of the work is efficiency analysis of 
the Resampling method in the case when it is applied for the 
information systems simulation. This task is topical, because 
Resampling cannot be correctly applied without such analysis. 



2 Goal and Tasks of the Work 

The goal of the work is obtaining algorithms for the calculation 
of Resampling method property efficiency, the application of the 
Resampling method for the information systems estimation and 
application of the algorithms for the method efficiency calculation. 
The following issues are supposed to be the main tasks of the 
work: 

• To study the Resampling approach and fields of its application. 

• Using simple and Hierarchical Resampling methodology, 
develop algorithms for the Resampling application for such 
tasks, as sample size optimization and the case of partially 
known distributions. 

• Develop algorithms for the method efficiency estimation in the 
mentioned cases. 

• Develop algorithms for the application of Resampling method 
for confidence intervals construction. 

• Develop algorithms that allow us to estimate the accuracy of 
Resampling confidence intervals. 

• Consider a possibility of applying the Resampling method in 
the information technology area. 

• Using the Resampling methodology, make estimations for 
different models from the information technology area and 
apply algorithms for the estimator efficiency calculation. 



3 Research Methodology 

As the theoretical and methodical basis of the promotion work, the 
classical works in the computer science, simulation, statistics and 
probability theory were used. 



Books in the corresponding areas, periodical publications of the 
thematic materials, materials of international conferences in the 
corresponding areas were used in the promotion work . 

During research, examples from information technology areas 
were analyzed, in which concrete application of the developed 
methodology was illustrated. Hypothetical data was used in 
examples, which illustrate the specific character and efficiency of 
the method as fully as possible. As the method efficiency criterion 
the variance of estimator was used. The change of the method 
efficiency depending of different factors was analyzed, which allows 
us to speak about the possibility of applying the method in practical 
situations. 

For solving the given problems both analytical and experimental 
methods were applied. Using the analytical methods analytical 
expressions for the method efficiency calculation in different 
situations were obtained. Using experimental methods the values 
of the methodic usage efficiency criterion for the concrete numerical 
examples were calculated, which allow to see the different factors 
influence to the method accuracy. 

4 Scientific Novelty of the Work 

Intensive statistical computer methods include many methods, such 
as the jackknife. Bootstrap and Resampling methods, and allow 
us to solve a wide class of problems. The Jackknife method was 
proposed by Tukey in 1958 as an estimator which is a combination 
of an estimator based on all data and estimators based on parts of 
data. In 1979 Efron proposed the Bootstrap method, which in fact 
is generalization of jackknife. 

In 1976 Ivnitsky proposed to use Resampling for the tasks of 
reliability estimation. This approach has been developing since 
1995 supervised by prof. Andronov. Andronov considered simple 
and Hierarchical Resampling methods, Andronov, Merkuryev and 
Loginova considered application of the method for reliability and 
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queuing theory, Andronov, Merkuryev and Fioshin considered 
Resampling method optimization tasks [2], Andronov and Fioshin 
considered Resamphng sum properties P, [3], the case of partially 
known distributions 0], confidence interval construction [5], in 
the present time Andronov and Afanasjeva work on the method 
application in regression analysis. 

The application of the Resampling method for analysis of 
information systems has not been analyzed before. Different 
models from the information technology area are analyzed in the 
work (multitask operation systems, database queries, reliability 
of information storage), the methodology of different variants 
of Resampling method application for the considered models 
has been examined (simple Resampling, Hierarchical Resampling, 
Resampling in the case of partially known distributions) , and also 
different tasks are considered (point estimation, interval estimation, 
sample size optimization). 

Algorithms for Resampling method efficiency calculation for the 
considered models are constructed in the work. These algorithms 
can be applied for a wide class of problems and show how to 
estimate the efficiency of the method in similar situations. In 
similar situations the efficiency of Resampling method can be 
analyzed using the same methodology. The results of the work 
can be used as the basis of the Resampling simulation software 
development. 

5 The Main Results of the Work 

The main results of the work are following: 

• The methodology of the Resampling method application is 
considered for different cases (Hierarchical Resampling, the 
case of partially known distributions) and for different tasks 
(the estimation of the expectation, optimization of sample 
sizes, confidence interval construction), which are described 
in articles of Andronov, Merkuryev and Fioshin; 
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• Models from information technology area are selected and 
described, and the Resampling method can be used for their 
analysis; 

• It is shown, that the Resampling method can be applied for 
the estimation and simulation of such models; 

• Algorithms for the method efficiency criteria calculation are 
obtained for each concrete system; 

• Different method variants for concrete systems have been 
compared; 

• The influence of the system parameters on the method 
efficiency is analyzed and conclusions are drawn about the 
method application possibility for the given concrete system 
class. 

6 Practical Application of the Work 

Using the results obtained in the work it is possible to use 
the Resampling method for information system estimation. The 
methodology of the method application and efficiency calculation 
are shown in the work, which allows us to use Resampling 
in practical simulation. The obtained results make software 
construction possible, which makes Resampling estimation of 
different systems and correctly estimates the method efficiency, 
allowing correct experiment planning. 

7 Publications and Participation at 
Conferences 

The results of the work have been presented in 8 publications [1-8], 
and also presented in discussions at the corresponding conferences. 
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8 Structure of the Work 

In the first section of tlie work tlie intensive statistical 
computer metliods are described, a sliort description is given. 
Tlie Resampling method is also described and its application 
possibilities for the information system estimation are shown. In 
each of the following sections one case or task of the Resampling 
method application is considered. The 2-nd section describes 
the simple Resampling, the 3-d section describes the Hierarchical 
Resampling, the 4-th section describes the task of sample size 
optimization, the 5-th section describes the case of partially 
known distributions, the 6-th section describes the construction 
of Resampling confidence intervals. Tasks are described and 
algorithms are given. Then follows the method efficiency 
calculation. At the end of each section examples are considered. 
Next numerical results follow, which allow us to compare different 
method variants and analyze the influence of the system properties 
on the method efficiency criteria. At the end of each section 
conclusions are made about the efficiency of the method application 
for the given case or task. 

9 Short Description of the Work Sections 

9.1 Intensive Statistical Computer Methods 

As the work is devoted to the Resampling method, which is one of 
the intensive statistical methods, in the first section of the work the 
intensive statistical methods analysis is performed. At the present 
time some authors consider computational statistics a separate 
discipline. 

It is often difficult to apply traditional statistical methods for 
complex systems modeling, non-stationary systems, cases when 
distributions differ from classical. In these cases it is better to 
use the intensive computational methods. 
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The intensive methods are simple, and it is simple to realize 
them. It is also simple to use such methods because few 
assumptions about model structure are required. 

On the other hand, intensive methods do not give accurate 
results, as the classical methods do. The simplicity of these 
methods and the existence of many variants leads to many 
realizations and increase the possibility of incorrect method usage. 
One must remember that many computations not necessarily 
guarantee a correct result. 

At the present time, 3 main intensive computer statistical 
methods are mentioned: 

• The Monte Carlo methods. 

• Randomization methods, which include cross-validation and 
the jackknife method. 

• Resampling methods. 

A brief description of each method group is given in the section. 

Next a general description of the proposed Resampling method 
follows. The possible application spheres of the method in the 
information technologies area are shown. 

The Resampling method can be used for the following problems 
in the information technology area: 

• Database design and performance analysis. 

• Software reliability. 

• Server performance and efficiency analysis. 

• Multitask operating systems work optimization. 

• Network analysis and optimization. 

• Information protection. 

• Information storage device reliability analysis and information 
backup. 



14 



The Resampling method can be successfully applied for system 
analysis, if the system has the following properties: 

• A small amount of the input statistical information. 

• The analyzed events are relatively rare. 

• An unknown type of the system random value distributions. 

• A known functional dependence on initial data. 

9.2 Resampling Point Estimation of Calculation System 
Models 

Suppose we have independent random variables Xi, X2, . . . ,Xm. 
The distribution functions -Fj(x) of these variables are unknown, 
but the sample populations Hi = {Xji, Xj2, • • • , ATj^.} are available 
for each variable Xj, i = 1, . . . ,m. 

Suppose a known function (j){xi, X2, ■ ■ ■ , Xm) of m real arguments 
is given. The task is to estimate the expectation 9 of the function 
(j), the arguments of which are random variables Xi,X2, . . . , Xm'- 

= -£^Fi,F2,...,F„0(^l,^2, • • • ,X,n). (1) 

The traditional estimation methods usually propose the so-called 
"plug-in" procedure. It means that instead of the real distribution 
function Fj(x) its estimators Fi{x) are used (as the estimators the 
empirical distribution functions are often used). Then the estimator 
^ of ^ is following: 

^=%„?„....F„0(^i,^2,...,^™). (2) 

The idea of the method application is following. We select at 
random an element from each sample Hi. Suppose at the step 
number / the element with number ji{l) is extracted from sample 
Hi. Let us create a vector X{1) from elements extracted on /-th 
step: X{1) = (Xijj(/), X2j2(i), . . . , Xmj^(^i)). 
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Let us repeat this procedure r times, obtaining realizations 
X{1),X{2), . . . , X{r). The estimator 6* of the value 6 is equal to 
an average of the function (p on all r realizations: 

e* = -±<p{x{i)). (3) 

It is proved, that the estimator 6* is unbiased: E 9* = 6. 

Let us take the estimator 9* variance as the method efficiency 
criterion. Let fi = E (p{X{l)); ^2 = E 0(X(/))^; /in = 
E (p{X {l))(fi{X {I')) , where / and /' are realization numbers. Using 
properties of variance, we have: 

Dr = -(/X2 + (r-l)/in)-/i'. (4) 

r 

Only the mixed moment Hu depends on the element extraction 
rules. 

In order to calculate ^u, we use the w-pair notation. We will 
say, that vectors j(/) and j{l') produce the w-pair, ii ji{l) = jiil') -^ 
i e w, or, in other words, the set u contains numbers of elements, 
which are equal in samples X{1) and X{1'). For example, vectors 
(2, 1, 4, 2) and (2, 2, 4, 1) produce the {1, 3}-pair. 

Let us suppose a Hu{u) is a conditional mixed moment by the 
condition, that the w-pair takes place. Let us suppose P{u)} is 
the probability to get the w-pair. Then the value of /Jn can be 
calculated as following: 

/ill = ^P{a;}/iii(w). (5) 

Example 1: The reaction time of an information system. Let us 
have a calculation system, the reaction time of which depends on 
some parameter X {X can be the size of input data in an algorithm, 
the size of the database for a database management system, the 
number of processes in a computer when the next process is created 
etc.). We suppose that X is a random variable, its distribution F[x) 
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is unknown, but the sample H = {Xi, X2, ■ ■ ■ , X„) of X realizations 
is available. 

In this case the function (p depends on one argument x. The 
task is to estimate the expectation of this function. 



9 = E (l){X). 



(6) 



We can use the Resampling method in order to estimate 6. The 
variance of estimator 6* is calculated. The comparison results are 
shown in table [H We can see that the variance of Resampling 
estimator is 10-15% greater than the classical estimator variance, 
but the application of Resampling is simpler, than the application 
of classical methods. 



Variance dependence on sample size n 



Table 1 



Sample size, 
n 


Classical estimator 
variance, D 9 


Resampling estimator 
variance, D 9* 


1 


781.25 


781.25 


2 


390.625 


398.437 


3 


260.417 


270.833 


5 


156.25 


168.75 


8 


97.6562 


111.328 


10 


78.125 


92.1875 


13 


60.0962 


74.5192 


15 


52.0833 


66.6667 



Example 2: sequential processes. Let us assume the task consists 
of m sequential processes. The random variables Xi,X2, ■ ■ ■ , X^ 
are the process execution times. The distributions Fj(x) of these 
variables are unknown, but only samples Hi of each process 
execution time are available. We need to estimate the average time 
of the task execution. 

In this case the function </> is the sum of variables Xi. We need 
to estimate the expectation of this sum: 



e = E X1 + X2 



Xr, 



(7) 
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Formulas for fin value calculation are obtained. Variance 
dependence on different parameters is analyzed, different cases are 
compared and it is shown that the method is relatively effective for 
solving this task. Variance dependence on sample sizes n is shown 
on the Fig. [H 




10 12 14 



Figure 1: Variance dependence on sample size n 

Example 3: Parallel processes. Suppose the task consists of m 
parallel processes. The random variables Xi,X2, . . . ,Xm are the 
process execution times. The distributions Fi{x) of these variables 
are unknown, but only samples Hi of each process execution time 
are available. We need to estimate the average time of the task 
execution. 

In this case the function (p is maximum of variables Xi. We need 
to estimate the expectation of this function: 



e = E max(Xi,X2,...,X„ 



(8) 



Formulas for /in value calculation are obtained. Variance 
dependence on different parameters is analyzed, different cases are 
compared and it is shown that the method is relatively effective for 
this task. 

Example 4: Reliability of information storage Suppose the 
information storage consists of 3 reservation devices. We say that 
the system is reliable if at least 2 of 3 devices work. The working 
times of the devices before failure are independent random variables 



Variance dependence on sample sizes n 



Table 2 



Sample 

size, 

n 


Resamples 
count. 
r=10 


Resamples 
count, 
r=20 


Rcsamples 
count, 
r=30 


Variance of 

classical 

estimator, 

DO 


1 


9.02778 


9.02778 


9.02778 


9.02778 


2 


4.96528 


4.73958 


4.66435 


4.51389 


3 


3.61111 


3.31019 


3.20988 


3.00926 


5 


2.52778 


2.16667 


2.0463 


1.80556 


8 


1.9184 


1.52344 


1.39178 


1.12847 


10 


1.71528 


1.30903 


1.17361 


0.902778 


12 


1.57986 


1.16609 


1.02816 


0.752315 


15 


1.44444 


1.02315 


0.882716 


0.601852 



Xi, X2 and X^. The distributions Fi(x), ^2(0;) and F^{x) of the 
device working time are unknown, only sample populations Hi, H2 
and H^ are available. The task is to estimate the probability that 
at the time moment t the system is reliable. 

In this case the function (^t(xi, 2:2, xs) is the indicator function 
which returns to 1 if the system works at time t and to 
if the systems fails, if working times of elements are xi,X2,X3 
correspondently. The function 0^ can be defined as follows: 



(l)t{Xi,X2,X'i) 



1 if at least 2 el-ts of {xi, X2, X3} are > t, 
otherwise. 



The goal is to estimate the expectation 9t of this function: 

Ot = E (f)tixi,X2,X3). 



(9) 



(10) 



It is clear that 6t is the probability that at the time moment t the 
system is reliable. 

Formulas for /in value calculation are obtained. Variance 
dependence on different parameters is analyzed, different cases are 
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compared and it is shown that the method is relatively effective for 
this task. Variance dependence on time t is shown on Fig. |2l 



Vari 
0.05 

0.04 

0.03 

0.02 

0.01 



n=5 

n=10 
n=lS 



Figure 2: Variance dependence on time t 



9.3 Hierarchical Resampling for the Point Estimation of 
Hierarchical Calculation Systems 

Hierarchical 

Resampling has the following advantages in comparison with the 

simple Resampling: 

• The method allows to accomplish simpler estimation of 
complex systems, which consist of subsystems. 

• The method allows us to perform parallel calculations for the 
subsystems analysis. 

• The method allows to accomplish optimization of sample sizes. 

• The method can be applied for complex information systems 
analysis, such as hierarchical queues to databases, enterprise 
databases, hierarchical servers structures etc. 



Suppose function 0(xi,X2, 



I ^T: 



can be represented by using 



sub functions 0j(-)- The result of the subfunction is used as the 
value of higher level function argument. In this case the function 
(j){xi,X2, . . . , Xm) can be represented by using the calculation tree. 
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The input variables Xi,X2,...,Xm correspond to the tree 
leaves. The rest of the nodes are intermediate ones, and 
intermediate functions <t>j{-) correspond to them. The result of each 
function is taken as an argument of the function on a higher level. 
The function (j){xi, X2, ■ ■ ■ , Xm) correspond to the root of the tree. 
An example of the calculation tree is presented on Fig. [31 




Figure 3: Calculation tree 

A sample H^ corresponds to each node v. During the simulation 
the samples are constructed iteratively, by levels. The total 
estimator 9* of the value 9 is equal to the average value at the 
root of the tree: 

d* = -T.y>^i^ (11) 

ll'k ; = i 

where Y^; are elements of the sample H^. 

Let us take variance D 9* of the estimator 9* as the method 
efficiency criterion. The variance calculation is based on the u- 
pair definition. The probabilities of w-pairs and conditional mixed 
moments fiu{u}) are calculated iteratively, by the tree levels. 

Example 1: Hierarchical query to database. Let we have a query 
to database that consists of 6 subqueries. A subquery i working 
time is a random variable Xi, i = 1, . . . ,6. The distributions Fi{x) 
of subquery working times are unknown, only sample populations 
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Hi are available for each i. 

The query is executed on 3 processors (or database servers). The 
execution rules are the following: 

• 1-st and 2-nd subqueries are executed on the 1-st processor, 
and they are executed in parallel; 

• 3-d and 4-th subqueries are executed on the 2-nd processor, 
and they are also executed in parallel, but the 2-nd processor 
ends its work when one of the subqueries gives a result; 

• 5-th and 6-th subqueries are executed on the 3-d processor, 
and they are executed sequentially. 

The task is to estimate probability 6*^, that to the time moment 
t the query will end its work, which can be written as following: 



dt = P{max(max(xi, X2),min(a;3, X4), X5 + xq) < t} 
= E (t)t{xi,...,XQ). 



(12) 



Formulas for /xn value calculation are obtained. Variance 
dependence on different parameters is analyzed, different cases are 
compared and it is shown that the method is relatively effective for 
this task. Variance dependence on time t is shown on Fig. HI 



variance 




10 20 30 40 50 60 70 



Figure 4: Comparison of simple and Hierarchical Resampling methods 

Example 2: Sequential - parallel query to database 
Let we have a query to database, the subqueries of which are 
organized in blocks. All subqueries in one block are executed in 
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parallel. The block gives the result when all subqueries in the block 
give the result. The query gives the result when the first block gives 
the result. 

Let us assume all subqueries inside a block is the same 
distribution of the working time. The distribution function Fi{x) 
of the subqueries working time is unknown, but only sample Hi is 
available. Only one sample is available for each block. 

The goal is to estimate the probability R{t), that the query 
working time is greater than t: R{t) = P{X > t}, where X is 
the working time of the query. If the distribution of the subqueries 
working time is known, then R{t) can be calculated in the following 
way: 

n 

R{t) = u{^-mt), (13) 

where Fi(t) - is the distribution function of the block i subquery 
working time. 

If we use the empirical distribution function Fi{t) for the R{t) 
value estimation, we get the following estimator: 

n 

m = m^-mt)- (14) 

It is shown that the estimator ( TT4l) is biased. The dependence 
of bias on the time t is shown in table [31 

The Resampling method gives an unbiased estimator for this 
task. The algorithm is obtained for estimator calculation. Table S] 
shows the variance of Resampling estimator depending on time t. 

9.4 Discrete Optimization of Resampling Sample Sizes 

In many practical tasks we need to give recommendations for 
sample sizes n,. It is clear that we select sample sizes n, 
automatically, they must be optimal. 

As the Resampling method efficiency criterion is variance, we 
need to select such rii values which minimize variance. Suppose 
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Tabic 3 
The expectation and bias (%) of the traditional estimator R*{t) depending on 

time t 



t 


0.1 


0.2 


0.3 


0.5 


0.7 


R{t) 


0.999 


0.987 


0.954 


0.818 


0.629 


ER*{t) 


0.992 


0.960 


0.901 


0.723 


0.523 


% 


1% 


3% 


6% 


13% 


20% 


t 


0.9 


1 


1.5 


2 


3 


m 


0.443 


0.362 


0.108 


0.026 


0.001 


E R*{t) 


0.349 


0.278 


0.076 


0.018 


0.001 


% 


27% 


30% 


42% 


50% 


61% 



Variance of the Resampling estimator 



Table 4 



t 


0.1 


0.2 


0.3 


0.5 


0.7 


0.9 


1 


1.5 


2 


D R*(t) 


0.079 


0.081 


0.085 


0.098 


0.108 


0.110 


0.108 


0.091 


0.082 



each element of sample Hi has weight Oj, z = 1, . . . , m, and the total 
weight is bound by b. Our task is to solve the following optimization 
task: 

minimize D{ni, n2, ■ ■ ■ , Uk) (15) 

by restriction 

aiUi + a2n2 + . . . + akUt < b, (16) 

where b, {ui} un {ui} are integer non-negative numbers, 
D{ni, n2, ■ ■ ■ ,nk) is the variance of the estimator, which depends 
on the sample sizes. 

In order to solve the given optimization task, we use the dynamic 
programming method. Let us consider the function 



i>i{a) 



aa. 



a)Cov{Xi,X[) 



1,2,..., A;, < a < 1. 
(17) 
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It can be proved that 






It also can be proved that 



d , , x\ , / \ — a 



'4^v{(y) = ^{-^4>vM\ ^ifaH )• (19) 

We can see that variance D 9* can be obtained as 

Dr = ^fc(0). (20) 

Values '0„(a) depend on all subnode sample sizes rij. Let us 
define these subnodes indexes by B^ and write ^^(tt) = i/jyla^rii : 
i e B^). 

Then the Bellman function, which must be calculated, can be 
written in the following way: 

$j,(a, z) = min -0^,(0;; rii : i E By), (21) 

where minimization is realized by non- negative integer variables n^, 
which satisfy the restriction 

Yl '^i^i ^ ^- (22) 

It can be proved, that the Bellman function can be represented 
in the following way: 

$„(a,2) = min^ (-— 0„(/i„)) $i ( a + -^zA, (23) 

and minimize it by integer non-negative variables riy and {zi : i E 
ly}, which satisfy restrictions 

avriy + Yzi<z. (24) 

ieiv 
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At the end the minimal variance D* 9* is equal to 

D* Q* = $;-,(0,6). 



(25) 



In order to calculate the optimal sample sizes nl,n2, . . . ,nl, we 
need to use the dynamic programming "forward" procedure. 

Example: Subquery sample size optimization. Let we have a 
query to database which consists of 6 subqueries, as in example 
1 of section 9.3. The execution time of the subquery i is random 
variable Xj, i = 1, . . . ,6. The distributions Fi[x) of these times are 
unknown, but only samples Hi are available for each i. 

The task is to estimate the expectation of the query working 
time: 

6 = E max(max(a:i, X2), min(a;3, X4),a;5 + xe). (26) 

Derivatives of all subfunctions (pi are calculated. The Bellman 
functions $^(a, z) are constructed, formulas (12T]) . ( 1221) . ( 1231) and 
(I2II) are iteratively applied, and formula (l25l) is applied to get an 
optimal solution. 

The obtained results are shown in table O We can see that 
the method allows us to decrease the variance of the estimator to 
10-40%. 



Optimization results 



Table 5 



A. 


n^ 


D* 


D 


% 


(0.1,0.7,0.2,0.4,0.8,0.5) 


(3,3,9,2,2,4,4,9,4,10) 


3.37 


4.30 


22% 


(0.2,0.2,0.4,0.4,0.8,0.8) 


(6,6,3,3,3,3,8,4,4,20) 


6.03 


6.95 


13% 


(0.2,0.3,1.0,1.2,0.5,0.3) 


(4,4,1,1,9,3,6,1,10,11) 


12.59 


17.88 


30% 


(1.2,0.1,0.3,2.1,0.1,1.5) 


(1,1,4,1,12,1,1,4,12,13) 


7.64 


13.61 


44% 



9.5 Point Estimation of Calculation Systems in the Case 
of Partially Known Distributions 

Suppose the distributions of some variables are known. Variables 



-^l,-'^2 



, Xm are given, the distributions of which are 
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unknown (but only samples Hi are available), and also variables 
Zi,Z2, . . . , Z,y are given, the distributions of which are known 
(functions Fi{x) are given). Function (p depends on vectors X and 
Z. 

The task is to estimate the expectation of function 0, the 
arguments of which are random variables X and Z: 

e = E(t){X,Z). (27) 

The question is the following: how to use the information 
available from Z knowledge in the most efficient way? 

The idea is to use the Hierarchical Resampling method, but make 
samples from distribution functions. 2 situations are possible: 

• It is possible to calculate the distribution of subfunction 
Fv,x{y) = P{4>v{x, Z) < y} in the tree node; 

• It is impossible to calculate the distribution of subfunction in 
the tree node. 

In the 1-st situation the sample of functions F^^i (y) is constructed 
in each node, where / is a step number, but x is not in index because 
it is extracted from subsamples. The calculation tree is shown on 
the Fig. El 

At the end the estimator 9* is calculated by formula 

9* = -Y: I ydFui{y). (28) 

' -^— oo 

In the 2-nd situation we use the A^- dimensional vector Y^i instead 
of the function F^iiy). In order to construct this vector, we select 
vectors Y{1, ^) = (l^j^(;)^) from subsamples. Then for each / = 
1,2,...,?7,^ and j we construct A^ random variables {Zji^ : ^ = 
1,2, ...,A^}. Then we construct Uy ■ N vectors Z{1,^) = {Zji^, 
calculate values 

Y,i^ = (jyiYil,0,Zil,0) 
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Figure 5: The case of known subf unctions distribution 
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Figure 6: The case of unknown subf unctions distribution 
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and construct vectors Y^i = (Y^i^ : ^ = 1,2, . . . ,N). This procedure 
is shown on Fig. [6l 

At the end the estimator 9* is calculated by formula 

1 r N 

0* = -^T.T.y^^^- (29) 

'^^ 1=1^=1 

Example: Hierarchical query to database. Let us consider the 
same query to database as in Example 1 of Section 9.3, but 
with partially known distributions of subquery working times. We 
suppose that the distribution functions F2{t), F^i^t) and F^iJ:) of 2- 
nd, 4-th and 6-th subqueries working times X2, X'^ and Xq are 
known, but the distribution functions Fi{t), F^lt) and F^{t) of 
1-st, 3-d and 5-th subqueries working times X[, X^ and X'^ are 
unknown, and only samples Hi, H^ un Hk, are available. The task 
is to estimate the probability R{t) = 9t, that at the time moment t 
the query will end its work. 

In order to follow the above mentioned notation, let us denote 
Xi = X^, X2 = X^, X^ = X^; Z\ = X2, Z2 = X^, Z3 = Xq. Then 
our goal is to estimate the expectation of the function (pt, where (f)t 
is the following function: 

6 (X Z) = I ^ if min{max{Xi, ZJ, X2, Z2, X^ + Z3} > t, 

^^^ ' ' I else. 

(30) 

In this case we have the 1-st situation, when the conditional 
expectation (^{Xi, X2, X^) = (f){X[, X'^, X'^) is known. It can be 
calculated as follows: 

0^ _ _ ifX^<t, 

F2{t)F\{t)Fe{t - ^5), if ^3 >t,X[< t,X'^ < t, 

^{X[,X'„X',) = { F\{t)F^{t-X',), iiX',>t,X[>t,X',<t, 

F2{t)Fi{t), iiX'^>t,X[<t,X'^>t, 

F\{t), iiX^>t,X[>t,X'^>t. 

(31) 
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Formulas for fin value calculation are obtained. Variance 
dependence on different parameters is analyzed, different cases are 
compared and it is shown that the method is relatively effective for 
solving this task. The comparison of the Resampling with unknown 
and partially known distributions is shown on Fig. [71 



10 20 30 40 50 60 70 



Figure 7: Comparison of Hierarchical Resampling method for unknown and 
partially known distributions 



9.6 Resampling Interval Estimation of Logical Systems 

Before we considered the point Resampling estimators. But in 
many practical tasks it is important to know the interval, where 
the value of parameter traps with the given probability. In this 
case we have to deal with interval estimation. 

Let we have a function (p{Xi,X2, ■ ■ ■ ,Xjn) of m random 
variables. The task is to construct the confidence interval 
with level 7 for the function (f){Xi,X2, . . . ,Xm) expectation 6 = 
E(P{X,,X2,...,XJ. 

Using the Resampling method, we can estimate the expectation 
9* of the function 0. We make r such realizations {61,62, ■■■ ,0*). 
We order this sequence, obtaining order statistics 9*-^^, 6%^ . . . , 6{^r)- 
Accept (6'(*iQ,rj), 00) as 1 — a upper confidence interval for the 
parameter 6. Here [arj means a greater integer number which 
is less than or equal to ar. 

Due to the vector (6*^), 6*^2), • • • , ^(r)) component dependence the 



30 



coverage probability of the parameter 6 by the interval (^'n^^n, oo) 
differs from 1 — a. The task is to calculate the actual coverage 
probability 

R = P{%.)J ^ ^}- (32) 

The method is described in a paper of Andronov, Fioshin [5]. 
Suppose the function 0(Xi,X2, • • • ,Xm) depends on the order 
of Xi only, not on the actual values. The idea is to fix this order 
and to find the conditional probability R = P{(^t(ar)\ — ^} '^^ ^^^ 
condition, that order is given. The disadvantage of this approach 
is the large dimension of the task, because the number of different 
combinations can be large. In order to decrease the dimensions, a 
protocol definition is proposed. 

At the beginning let us describe the protocol definition in the 
case of 2 dimensions. Suppose the function (j){xi,X2) depends on 2 
arguments. We have 2 samples Hi = (Xn, X12, . . . , Xi„J and H2 = 
(X21, X22, . . . , X2n2)- Let us order both samples and calculate, how 
many elements of the second sample are between the first sample 
neighbor elements: 

Cj = #{-^2i : X^i-) < X2J < Xi(j+i)}, (33) 

where ^X means the number of elements in set X. 

We can find the probability of each such protocol. If we know 
the protocol, we can calculate the conditional coverage probability 
and then the coverage probability R. 

In the multidimensional case the protocol is defined in analogous 
way. Suppose the function (j){xi,X2, ■ ■ ■ ,Xm) depends on m 
arguments; we have m samples Hi. We order elements of all samples 
and write a number of a sample, which the each element belongs 
to: 

Cj = i -^ X(j) e Hi. (34) 

For example, if Hi = (2.5,6.3,1), 

H2 = (0.5,4.7), ifs = (3.1,0.2,5.2), then the ordered sequence 
is (0.2,0.5,1,2.5,3.1,4.7,5.2,6.3) un c = (3,2,1,1,3,2,3,1). We 
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can calculate the probability of each such protocol, the conditional 
coverage probability and the coverage probability R. 

Example 1: Minimal-time process selection. Let us have an 
information system which controls processes. It is known that the 
optimal strategy of such system is to execute the shortest processes 
first. 

Let we have m processes in the system. We suppose that 
the processes execution times are independent random variables 
Xi, X2, . . . , Xm- The distributions Fi{x), F2{x), . . . , Fm{x) are 
unknown, but only sample populations Hi,H2, . . . , Hm are available 
for each Xi. 

We suppose that the system selects the process the execution 
time of which is predicted to be minimal; the system gives a number 
m to this process. This means that the system supposes that Xm < 
min(Xi, X2, . . . , Xm)- Our task is to estimate the probability of the 
correct selection: 

e = P{Xm < mm{X,,X2, . . .,Xm)}. (35) 

We also need to construct the upper confidence interval for 6 
with a given confidence level 7. 

The corresponding protocols were constructed. The probability 
of each protocol was calculated, the conditional coverage probability 
found. It allowed us to find an actual coverage probability R. The 
results of the calculation are presented in Table O 

Example 2: Process ordering. Suppose like in the previous 
example we have an information system which controls processes. 
The system orders the processes by the estimated execution time. 
The system gives corresponding numbers to ordered processes: this 
means that the system supposes that Xi < X2 < ■ ■ ■ < Xm- The 
goal is to estimate the probability of the correct ordering 

e = P{X,<X2<---<Xm}- (36) 

We also need to construct the upper confidence interval for 9 
with a given confidence level 7. 
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Actual coverage probabilities 



Table 6 





Coverage probability R 


(711,712,713) 


7=0.5 


7=0.6 


7=0.7 


7=0.8 


7=0.9 


(3,3,3) 


0.533 


0.576 


0.625 


0.686 


0.770 


(9,9,3) 


0.519 


0.571 


0.630 


0.701 


0.793 


(4,4,4) 


0.521 


0.578 


0.640 


0.709 


0.797 


(6,6,4) 


0.516 


0.576 


0.642 


0.715 


0.807 


(5,5,5) 


0.515 


0.579 


0.646 


0.722 


0.817 


(3,3,8) 


0.516 


0.581 


0.651 


0.728 


0.823 


(4,4,7) 


0.512 


0.580 


0.652 


0.732 


0.830 



The corresponding protocols were constructed. The probabihty 
of each protocol was calculated, the conditional coverage probability 
found. It allowed us to find an actual coverage probability R. The 
results of the calculation are presented in Table [71 



Actual coverage probabilities 



Table 7 





Coverage probability R 


(711,712,773) 


7=0.5 


7=0.6 


7=0.7 


7=0.8 


7=0.9 


(3,3,3) 


0.593 


0.635 


0.680 


0.730 


0.803 


(9,9,3) 


0.524 


0.595 


0.675 


0.762 


0.862 


(4,4,4) 


0.540 


0.606 


0.677 


0.757 


0.848 


(6,6,4) 


0.525 


0.600 


0.678 


0.766 


0.864 


(5,5,5) 


0.523 


0.601 


0.682 


0.770 


0.866 


(3,3,8) 


0.536 


0.604 


0.678 


0.760 


0.855 


(4,4,7) 


0.522 


0.600 


0.681 


0.769 


0.866 



Conclusions 



In the present work the properties of the Resampling method 
were analyzed and the possibility of its application to the 
information systems estimation was studied. Different Resampling 
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method application cases and tasks were analyzed, such as simple 
Resampling, Hierarchical Resampling, Resampling in the case of 
partially known distributions, sample size optimization, confidence 
interval construction. 

For each of the mentioned situations or tasks the methodology 
and algorithms of the Resampling method application were shown. 
It was shown how to calculate the values of the method efficiency 
criteria. 

For each of the mentioned tasks or situations examples from 
the information systems area were analyzed, and the Resampling 
method was applied for the systems estimation. For each class of 
the task the methodology of the Resampling method application 
was shown, algorithms were obtained for the method efficiency 
calculation, a number of examples illustrate the dependence of 
different factors on the efficiency of the method, and a comparison 
of various methods was made. 

From the obtained results it is possible to conclude that the 
Resampling method can be a good alternative to the classical 
methods in the case of information systems analysis. 

The methodology that is obtained in the present work and 
other results can be a basis of the software that performs system 
simulation and estimation using the Resampling approach. 
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